<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <title>Non-uniform file encodings in the Eclipse Platform</title>
  <link rel="stylesheet" href="../default_style.css" type="text/css">
</head>
<body text="#000000" bgcolor="#ffffff">
<h1>Non-uniform file encodings in the Eclipse Platform</h1>
<p><font size="-1">Last modified: February 23, 2004</font> </p>
<cite><strong>Plan item description:</strong> Eclipse 2.1 uses a single
global file encoding setting for reading and writing files in the
workspace. This is problematic; for example, when Java source files in
the workspace use OS default file encoding while XML files in the
workspace use UTF-8 file encoding. The Platform should support
non-uniform file encodings. [Platform Core, Platform UI, Text, Search,
Compare, JDT UI, JDT Core] [Theme: User experience] (bug <a
 href="http://bugs.eclipse.org/bugs/show_bug.cgi?id=37933">37933</a>, <a
 href="http://dev.eclipse.org/bugs/show_bug.cgi?id=5399">5399</a>) </cite>
<p>The pre-M7 situation is as follows:</p>
<ul>
  <li><code>ResourcesPlugin.getEncoding</code> returns the default
encoding for the workspace (the <code>org.eclipse.core.resources.encoding</code>
preference value if available, otherwise the value of the <code>file.encoding</code>
Java system property).</li>
  <li><code>IFile.getContents</code>/<code>setContents</code> work with
byte streams - no encoding can be applied.</li>
  <li><code>IFile.getEncoding</code> tries to guess the file encoding
(looking for the <a
 href="http://www.unicode.org/unicode/faq/utf_bom.html">Byte Order Mark</a>),
which is not enough. Also, this API has no known client 
so far. This API method would be deprecated.</li>
  <li>the Java compiler supports non-uniform encondings for Java source
files, but in Eclipse it relies on <code>ResourcesPlugin.getEncoding<span
 style="font-family: helvetica,arial,sans-serif;">
(same value for all sources)</span></code><span
 style="font-family: helvetica,arial,sans-serif;">.</span></li>
  <li>the text editor framework supports setting the encoding for files
being edited (setting a persistent property on the file resource), but
there is no support for setting the encoding of multiple files
simultaneously, and other components are not aware of the encoding
settings.</li>
</ul>
<h2>Requirements&nbsp;</h2>
<ul>
  <li>the encoding of a file should be automatically determined by
considering the file's content and/or its name or extension.</li>
  <li>encoding information should be available not only for workspace
resources (e.g. IFiles) but for external files too. This has become
more important in light of the recent RCP work since use of the
resource plugin has become optional.</li>
  <li>encoding information should be available for local history
contents (IFileState) and archives (e.g. *.zip, and *.jar).</li>
  <li>in the future it should be possible to use the content based
encoding interpreter for determining more information about the file
(e.g. content type) without duplicating the mechanism but rather by
augmenting
it.</li>
  <li>users should be able to set the default encoding for a project.</li>
  <li>users should be able to share the default encoding settings in a
team
repository (metadata should reside in the project content area).</li>
  <li>file contents-based encoding prevails upon the inherited encoding
setting.</li>
  <li>users should be able to easily store a file in a different
encoding (in order to change its encoding).</li>
</ul>
<h2>Proposed solution</h2>
In addition to the existing approach of having a single global encoding for a 
workbench we propose 
<ol>
  <li>an extensible mechanism to determine the encoding of a stream by
analyzing its contents or, if available its file name,</li>
  <li>to add a default encoding property to projects. This default
encoding is used if no encoding could be determined in the first step.</li>
</ol>
We do not (yet) propose a settable encoding attribute per file because 
<ul>
  <li>we do not see an immediate need for this fine level of
granularity,</li>
  <li>we have no sharable file attributes in Eclipse which would make
sharing of files with different encodings difficult.</li>
</ul>
The encoding for a stream or an IStorage (as returned by two<code> getCharset</code> 
methods - see API changes) will be: 
<ol>
  <li> the encoding discovered by a <span style="font-style: italic;">content</span><em> 
    interpreter</em> associated to the file extension (or file type), if one exists 
    <em>and</em> can determine the encoding, or</li>
  <li> the default encoding define for the enclosing project, if any, or</li>
  <li>the global workspace encoding (equivalent to
ResourcesPlugin.getEncoding()).</li>
</ol>
<p>Regarding #1, an extension-point would allow file format-aware
encoding interpreters to register to the encoding discovery mechanism
for specific file types (extensions) or to associate existing encoding
interpreters to their own file extensions. Users would be able to
associate more file extensions for the known interpreters (preference).</p>
<p>All clients, when creating character-based streams when
reading/writing the contents of a file resource, should pass along the
charset string obtained from one of the <code>getCharset</code>
methods instead of
the one provided by <code>ResourcesPlugin.getEncoding</code>. Examples
are: text editors, compiler, search, compare. </p>
<h3>API changes</h3>
<h4>Added:</h4>
To make the encoding support available for non-workspace based
resources we propose to add the following method to
org.eclipse.core.runtime.IPlatform:<br>
<pre>public interface <span style="font-weight: bold;">IPlatform</span> {<br>    // ...<br>    public String <span
 style="font-weight: bold;">getCharset</span>(InputStream stream, String fileExtension) throws CoreException;<br>    // ...<br>}<br></pre>
<p>The InputStream seems to be the most widely used and scalable
mechanism to get
access to any kind of byte content. InputStreams can be easily created
for a
java.io.File, an IStorage (which subsumes IFile and IFileState; see
below), as
well as for bytes in memory (ByteArrayInputStream).<br>
The optional file extension argument can be used to quickly reject more
expensive ways for infering the encoding from the contents.<br>
</p>
A corresponding implementation (based on IContentInterpreters; see
below) lives in org.eclipse.core.runtime.Platform.<br>
<br>
For the resource plugin we propose to add a new interface
IEncodedStorage that adds the single method getCharset to the existing
IStorage interface:<br>
<pre>interface <span style="font-weight: bold;">IEncodedStorage</span> extends IStorage {<br>    public String <span
 style="font-weight: bold;">getCharset</span>() throws CoreException;<br>}<br></pre>
<p>Its method getCharset returns the name of the encoding for an
IStorage. It would make sense to add this method directly to the
IStorage interface, since any InputStream can only be interpreted
correctly if the used encoding is known. But because clients are
allowed to implement IStorage this would be a breaking API change, so
we decided to introduce a separate extension to IStorage. <br>
</p>
<p>Two existing interfaces will extend IEncodedStorage: IFile and
IFileState, two concrete class will provide an implementation: File and
FileState.<br>
</p>
<p>For both, files and file states, the implementation of getCharset
first uses IPlatform.getCharset(...) from above to find an encoding
based on any registered IContentInterpreters. If no encoding can be
determined, File.getCharset() locates the enclosing project of the file
and queries its IProjectDescription for a default encoding. For this we
need the following two new methods on IProjectDescription:<br>
</p>
<pre>interface <span style="font-weight: bold;">IProjectDescription</span> {<br>    // ...<br>    public String <span
 style="font-weight: bold;">getDefaultCharset</span>();<br>    public void <span
 style="font-weight: bold;">setDefaultCharset</span>(String charset);<br>    // ...<br>}<br></pre>
If no default encoding has been defined fo the project, the workspace's
default encoding preference is returned (via the existing API).<br>
<br>
Other implementers of IStorage will have to decide whether they should
base their implementation on IEncodedStorage.<br>
<br>
The implementation of Platform.getCharset will make use of content interpreters 
implementing the IContentInterpreter interface and that can be associated to file 
types through a new Core Runtime extension point &quot;org.eclipse.core.runtime.contentInterpreter&quot;. 
Users can associate additional file extensions via preferences.<br>
<p>The method interpretContent does not return the detected encoding but stores 
  it into a result object of type IContentInfo that is passed in as an argument. 
  This approach makes it possible to allow for collecting additional information 
  (like 'type'/'subtype') instead of just the encoding.</p>
<pre>interface <span style="font-weight: bold;">IContentInterpreter</span> {    
	public void <span style="font-weight: bold;">interpretContent</span>(IContentInfo result, InputStream contents);
}</pre>
The IContentInfo is:
<pre>public interface <span style="font-weight: bold;">IContentInfo</span> {
	public void <span style="font-weight: bold;">setCharset</span>(String charset);
	public String <span style="font-weight: bold;">getCharset</span>();
}</pre>
Since we would not allow clients to implement (or extend) IContentInfo,
we will be able to extend the API with new setters and getters in the
future without breaking API. <br>
<p>The platform would provide itself implementations of
IContentInterpreters for xml and other
popular file formats.<br>
</p>
<h4>Deprecated:</h4>
<span style="font-family: monospace;">public int IFile.getEncoding()</span><br>
<span style="font-family: monospace;">public int IFile.ENCODING_</span>*
constants<br>
<br>
<code><span style="font-family: monospace;">public String
ResourcesPlugin.getEncoding()</span>: </code>Since all clients of this
method will most likely have to adapt their code, I suggest to
deprecate getEncoding() and introduce a new method getDefaultCharset()
that better reflects the real purpose (and brings it more in line with
IProjectDescription.getDefaultCharset())<br>
<br>
<h3>UI Changes<br>
</h3>
We need to add new UI for changing the default encoding for a project.
A
good place for this would be the Property dialog
since encoding can be considered a property of the project, similar to
the read-only property etc. The property dialog for files
would only show the current value for the encoding but would not allow
to change it. <br>
<br>
We should provide a &quot;<span style="font-style: italic;">Convert Encoding</span>&quot; 
action that converts the contents of a file (or all files in a hierarchy) to a 
different encoding. This action would ask the user for two encodings: the first 
is used when reading all selected files and the second when writing these files 
back to the workspace. <br>
The action would not
change the encoding value returned by getCharset() but it would provide
a means to make the encoding of multiple files consistent with the
default encoding of the enclosing project.<br>
(An alternative to this UI would be to provide something like a &quot;Save with 
encoding&quot; action for editors. But this UI seems to be less convenient if 
the encoding of multiple files needs to be changed).<br>
<br>
In order to make sharing of files with heterogenous encodings easier,
we'll have to enhance the compare/merge tools to be able to work with
heterogenous encodings:<br>
<br>
To facilitate that, we try to automatically determine the encoding for
the remote resource<br>
<ul>
  <li>by knowing the default encoding of the remote project if the
.project file (containing the encoding attribut) has been synched
first, or<br>
  </li>
  <li>by using the local IEncodingInterpreter mechanism for the remote
resources (which are available as streams), or<br>
  </li>
  <li>by allowing the user to change the encoding for the remote
resource on the fly until it displays correctly.</li>
</ul>
With these means it becomes possible to compare and merge files
independent from the fact whether we use the same encodings on both
sides or not.<br>
<br>
However, if we want to use the same encoding (that is if we catchup with the remote 
.project file), we will have to convert the encoding of our local files to adapt 
them to the new encoding. For this we will provide the &quot;<span style="font-style: italic;">Convert 
Encoding&quot;</span> action in the Compare/Merge tools where required.<br>
<br>
<h3>Scenarios</h3>
<ul>
  <li>The user opens text files whose contents was created using encoding &quot;MS932&quot; 
    in a workspace whose default encoding is &quot;US-ASCII&quot;. It was not 
    possible to guess the file encoding automatically, so what the user sees is 
    gibberish. The user figures out the cause of the problem and explicitly sets 
    the encoding for the project containg the files to be &quot;MS932&quot;. He 
    will have to reload all editors to see the contents correctly and will have 
    to trigger a full build in the affected project. </li>
  <li>The user has a Java project with a few Java files and no explicitely specified 
    project encoding and a CP1252 workspace encoding. Now he wants to start using 
    all kinds of Unicode characters in his Java files. He sets the default encoding 
    of his project to UTF-16 and he converts all existing Java files to the UTF-16 
    encoding. All newly created Java files will automatically have the correct 
    encoding. Project metadata files like &quot;.project&quot; or &quot;plugin.xml&quot; 
    files will still be read in their correct encoding since IContentInterpreters 
    still apply to them.</li>
  <li>Determining the encoding to use for newly created files: Normally the encoding 
    becomes relevant on saving the file for the first time. Since the IFile already 
    exists and knows its project, the encoding to use can be determined by the 
    proposed API. A potential problem might arise from the fact that a newly created 
    file should use an encoding that is consistent with the encoding it would 
    get from an IContentInterpreter. Examples are *.properties and *.xml files. 
    They have a UTF-8 encoding even if the enclosing project uses a different 
    encoding. The code that writes these files to disk (and defines the initial 
    encoding) must understand this. It can neither use the encoding for the project, 
    nor can it use the encoding for the file (because the file is still empty 
    when the IEncodingInterpreter tries to determines its encoding).<br>
  </li>
</ul>
<ul>
</ul>
</body>
</html>
